Now that we have a sense of what our data is like we can get started with data analysis.
Running the preprocessing
The next major function of the recipes package is prep().
This function updates the recipe object based on the training data. It estimates parameters (estimating the required quantities and statistics required by the steps for the variables) for preprocessing and updates the model terms, as some of the predictors may be removed, this allows the recipe to be ready to use on other datasets. It doesn’t necessarily actually execute the preprocessing itself, however we will specify in argument for it to do this so that we can take a look at the preprocessed data.
There are some important arguments to know about: 1) training - you must supply a training data set to estimate parameters for preprocessing operations (recipe steps) - this may already be included in your recipe - as is the case for us 2) fresh - if TRUE - will retrain and estimate parameters for any previous steps that were already prepped if you add more steps to the recipe 3) verbose - if TRUE shows the progress as the steps are evaluated and the size of the preprocessed training set 4) retain - if TRUE then the preprocessed training set will be saved within the recipe (as template). This is good if you are likely to add more steps and don’t want to rerun the prep() on the previous steps. However this can make the recipe size large. This is necessary if you want to actually look at the preprocessed data.
oper 1 step dummy [training]
oper 2 step corr [training]
oper 3 step nzv [training]
The retained training set is ~ 0.26 Mb in memory.
[1] "var_info" "term_info" "steps" "template"
[5] "levels" "retained" "tr_info" "orig_lvls"
[9] "last_term_info"
There are also lots of useful things to checkout in the output of prep(). You can see: 1) the steps that were run
2) the variable info (var_info)
3) the model term_info 4) the new levels of the variables 5) the original levels of the variables orig_lvls
6) info about the training data set size and completeness (tr_info)
Note: You may see the prep.recipe() function in material that you read about the recipes package. This is referring to the prep() function of the recipes package.
Rows: 584
Columns: 36
$ id <fct> 1003.001, 1027.0001, 1033.1002, 1055.001,…
$ value <dbl> 9.597647, 10.800000, 11.212174, 12.375394…
$ fips <fct> 1003, 1027, 1033, 1055, 1069, 1073, 1073,…
$ lat <dbl> 30.49800, 33.28126, 34.75878, 33.99375, 3…
$ lon <dbl> -87.88141, -85.80218, -87.65056, -85.9910…
$ CMAQ <dbl> 8.098836, 9.766208, 9.402679, 9.241744, 9…
$ zcta_area <dbl> 190980522, 374132430, 16716984, 154069359…
$ zcta_pop <dbl> 27829, 5103, 9042, 20045, 30217, 9010, 16…
$ imp_a500 <dbl> 0.01730104, 1.96972318, 19.17301038, 16.4…
$ imp_a15000 <dbl> 1.4386207, 0.3359198, 5.2472094, 5.161210…
$ county_area <dbl> 4117521611, 1564252280, 1534877333, 13856…
$ county_pop <dbl> 182265, 13932, 54428, 104430, 101547, 658…
$ log_dist_to_prisec <dbl> 4.648181, 7.219907, 5.760131, 5.261457, 7…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 8.517193, 9.066563, 8…
$ log_pri_length_25000 <dbl> 11.32735, 10.12663, 10.15769, 12.01356, 1…
$ log_prisec_length_500 <dbl> 7.295356, 6.214608, 8.611945, 8.740680, 6…
$ log_prisec_length_1000 <dbl> 8.195119, 7.600902, 9.735569, 9.627898, 7…
$ log_prisec_length_5000 <dbl> 10.815042, 10.170878, 11.770407, 11.72888…
$ log_prisec_length_10000 <dbl> 11.886803, 11.405543, 12.840663, 12.76827…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 4.462…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 4.678311, 3…
$ popdens_county <dbl> 44.265706, 8.906492, 35.460814, 75.367038…
$ popdens_zcta <dbl> 145.7164307, 13.6395554, 540.8870404, 130…
$ nohs <dbl> 3.3, 11.6, 7.3, 4.3, 5.8, 7.1, 2.7, 11.1,…
$ somehs <dbl> 4.9, 19.1, 15.8, 13.3, 11.6, 17.1, 6.6, 1…
$ hs <dbl> 25.1, 33.9, 30.6, 27.8, 29.8, 37.2, 30.7,…
$ somecollege <dbl> 19.7, 18.8, 20.9, 29.2, 21.4, 23.5, 25.7,…
$ associate <dbl> 8.2, 8.0, 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, …
$ bachelor <dbl> 25.3, 5.5, 12.7, 10.0, 13.7, 5.9, 17.6, 7…
$ grad <dbl> 13.5, 3.1, 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, …
$ pov <dbl> 6.1, 19.5, 19.0, 8.8, 15.6, 25.5, 7.3, 8.…
$ hs_orless <dbl> 33.3, 64.6, 53.7, 45.4, 47.2, 61.4, 40.0,…
$ urc2013 <dbl> 4, 6, 4, 4, 4, 1, 1, 1, 1, 1, 2, 3, 3, 3,…
$ aod <dbl> 37.363636, 34.818182, 36.000000, 43.41666…
$ state_California <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ city_Not.in.a.city <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,…
For easy comparison sake - here is our original data: #### {.scrollable }
Rows: 876
Columns: 50
$ id <fct> 1003.001, 1027.0001, 1033.1002, 1049.1003…
$ value <dbl> 9.597647, 10.800000, 11.212174, 11.659091…
$ fips <fct> 1003, 1027, 1033, 1049, 1055, 1069, 1073,…
$ lat <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 3…
$ lon <dbl> -87.88141, -85.80218, -87.65056, -85.9683…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama…
$ county <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "…
$ city <chr> "Fairhope", "Ashland", "Muscle Shoals", "…
$ CMAQ <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9…
$ zcta <fct> 36532, 36251, 35660, 35962, 35901, 36303,…
$ zcta_area <dbl> 190980522, 374132430, 16716984, 203836235…
$ zcta_pop <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 90…
$ imp_a500 <dbl> 0.01730104, 1.96972318, 19.17301038, 5.78…
$ imp_a1000 <dbl> 1.4096021, 0.8531574, 11.1448962, 3.86764…
$ imp_a5000 <dbl> 3.3360118, 0.9851479, 15.1786154, 1.23114…
$ imp_a10000 <dbl> 1.9879187, 0.5208189, 9.7253870, 1.031646…
$ imp_a15000 <dbl> 1.4386207, 0.3359198, 5.2472094, 0.973044…
$ county_area <dbl> 4117521611, 1564252280, 1534877333, 20126…
$ county_pop <dbl> 182265, 13932, 54428, 71109, 104430, 1015…
$ log_dist_to_prisec <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9…
$ log_pri_length_10000 <dbl> 9.210340, 9.210340, 9.274303, 10.409411, …
$ log_pri_length_15000 <dbl> 9.630228, 9.615805, 9.658899, 11.173626, …
$ log_pri_length_25000 <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 1…
$ log_prisec_length_500 <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8…
$ log_prisec_length_1000 <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9…
$ log_prisec_length_5000 <dbl> 10.815042, 10.170878, 11.770407, 10.21420…
$ log_prisec_length_10000 <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 1…
$ log_prisec_length_15000 <dbl> 12.205723, 12.042963, 13.282656, 12.35366…
$ log_prisec_length_25000 <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 1…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.350…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4…
$ popdens_county <dbl> 44.265706, 8.906492, 35.460814, 35.330814…
$ popdens_zcta <dbl> 145.716431, 13.639555, 540.887040, 40.718…
$ nohs <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7,…
$ somehs <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, …
$ hs <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2,…
$ somecollege <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5,…
$ associate <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, …
$ bachelor <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17…
$ grad <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, …
$ pov <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7…
$ hs_orless <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4,…
$ urc2013 <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2,…
$ urc2006 <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2,…
$ aod <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 4…
Notice how we only have 36 variables now instead of 50! Two of these are our ID variables (fips and the actual monitor ID (id)) and one is our outcome (value). Thus we only have 33 predictors now. We can also see that variables that we no longer have any categorical variables. Variables like state are gone and only state_California remains as it was the only state identity to have nonzero variance. We can see that California had the largest number of monitors compared to the other states. We can also see that there were more monitors listed as "Not in a city" than any city.
Note: Recall that you must specify retain = TRUE argument of the prep() function to use juice().
Rows: 292
Columns: 36
$ id <fct> 1049.1003, 1073.101, 1073.2006, 1089.0014…
$ value <dbl> 11.659091, 13.114545, 12.228125, 12.23294…
$ fips <fct> 1049, 1073, 1073, 1089, 1103, 1121, 4013,…
$ lat <dbl> 34.28763, 33.54528, 33.38639, 34.68767, 3…
$ lon <dbl> -85.96830, -86.54917, -86.81667, -86.5863…
$ CMAQ <dbl> 8.534772, 9.303766, 10.235612, 9.343611, …
$ zcta_area <dbl> 203836235, 148994881, 56063756, 46963946,…
$ zcta_pop <dbl> 8300, 14212, 32390, 21297, 30545, 7713, 5…
$ imp_a500 <dbl> 5.78200692, 0.06055363, 42.42820069, 23.2…
$ imp_a15000 <dbl> 0.9730444, 2.9956557, 12.7487614, 10.3555…
$ county_area <dbl> 2012662359, 2878192209, 2878192209, 20761…
$ county_pop <dbl> 71109, 658466, 658466, 334811, 119490, 82…
$ log_dist_to_prisec <dbl> 3.721489, 7.301545, 4.721755, 4.659519, 6…
$ log_pri_length_5000 <dbl> 8.517193, 9.683336, 10.737240, 8.517193, …
$ log_pri_length_25000 <dbl> 11.90959, 12.53777, 12.99669, 11.47391, 1…
$ log_prisec_length_500 <dbl> 7.310155, 6.214608, 7.528913, 8.760549, 6…
$ log_prisec_length_1000 <dbl> 8.585843, 7.600902, 9.342290, 9.543183, 8…
$ log_prisec_length_5000 <dbl> 10.214200, 11.262645, 11.713190, 11.48606…
$ log_prisec_length_10000 <dbl> 11.50894, 12.14101, 12.53899, 12.68440, 1…
$ log_nei_2008_pm10_sum_15000 <dbl> 3.3500444, 6.6241114, 5.8268686, 3.861625…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.1719202, 7.5490587, 8.8205542, 5.219092…
$ popdens_county <dbl> 35.330814, 228.777633, 228.777633, 161.26…
$ popdens_zcta <dbl> 40.718962, 95.385827, 577.735106, 453.475…
$ nohs <dbl> 14.3, 7.2, 0.8, 1.2, 4.8, 16.7, 19.1, 6.4…
$ somehs <dbl> 16.7, 12.2, 2.6, 3.1, 7.8, 33.3, 15.6, 9.…
$ hs <dbl> 35.0, 32.2, 12.9, 15.1, 28.7, 37.5, 26.5,…
$ somecollege <dbl> 14.9, 19.0, 17.9, 20.5, 25.0, 12.5, 18.0,…
$ associate <dbl> 5.5, 6.8, 5.2, 6.5, 7.5, 0.0, 6.0, 8.8, 3…
$ bachelor <dbl> 7.9, 14.8, 35.5, 30.4, 18.2, 0.0, 10.6, 1…
$ grad <dbl> 5.8, 7.7, 25.2, 23.3, 8.0, 0.0, 4.1, 5.7,…
$ pov <dbl> 13.8, 10.5, 2.1, 5.2, 8.3, 18.8, 21.4, 14…
$ hs_orless <dbl> 66.0, 51.6, 16.3, 19.4, 41.3, 87.5, 61.2,…
$ urc2013 <dbl> 6, 1, 1, 3, 4, 5, 1, 2, 5, 4, 4, 6, 6, 1,…
$ aod <dbl> 33.08333, 42.45455, 44.25000, 42.41667, 4…
$ state_California <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
$ city_Not.in.a.city <dbl> NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, NA…
Notice that our city_Not.in.a.city variable seems to be NA values. Why might that be?
Ah! Perphas it is because some of our levels were not previously seen in the training set!
Let’s take a look using the set operations of the dplyr package. We can take a look at cities that were different between the test and training set.
[1] 376 1
[1] 51 1
Indeed, there are lots of different cities in our test data that are not in our training data!
Maybe remove this?: Thus we need to update our original recipe to include a very important step function called step_novel() this helps in cases like this were there are new factors in our testing set that were not in our training set. It is a good idea to include this in most of your recipes where you have a categorical variables with many distinct values. This step needs to come before we create dummy variables. However, we are also creating a dummy variable from this, which still results in a problem.
Let’s modify the city variable to be values of in a city or not in a city using the if_else() function of dplyr. Alternatively you could create a custom step function to do this and add the step function to your recipe, but that is beyond the scope of this case study.
We need to create a new recipe to move forward, as the levels of our variables are established then. We would also potentially have this issue for state and county. So let’s also do a similar thing for state. The county variables appears to get dropped due to either correlation or near zero variance. It is likely due to near zero variance becuase this is the more granular of these geographic categorical variables and likely sparse.
<Analysis/Assess/Total>
<584/292/876>
Now let’s retrain our training data and try baking our test data:
oper 1 step dummy [training]
oper 2 step corr [training]
oper 3 step nzv [training]
The retained training set is ~ 0.26 Mb in memory.
Rows: 584
Columns: 37
$ id <fct> 1003.001, 1027.0001, 1033.1002, 1055.001,…
$ value <dbl> 9.597647, 10.800000, 11.212174, 12.375394…
$ fips <fct> 1003, 1027, 1033, 1055, 1069, 1073, 1073,…
$ lat <dbl> 30.49800, 33.28126, 34.75878, 33.99375, 3…
$ lon <dbl> -87.88141, -85.80218, -87.65056, -85.9910…
$ CMAQ <dbl> 8.098836, 9.766208, 9.402679, 9.241744, 9…
$ zcta_area <dbl> 190980522, 374132430, 16716984, 154069359…
$ zcta_pop <dbl> 27829, 5103, 9042, 20045, 30217, 9010, 16…
$ imp_a500 <dbl> 0.01730104, 1.96972318, 19.17301038, 16.4…
$ imp_a15000 <dbl> 1.4386207, 0.3359198, 5.2472094, 5.161210…
$ county_area <dbl> 4117521611, 1564252280, 1534877333, 13856…
$ county_pop <dbl> 182265, 13932, 54428, 104430, 101547, 658…
$ log_dist_to_prisec <dbl> 4.648181, 7.219907, 5.760131, 5.261457, 7…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 8.517193, 9.066563, 8…
$ log_pri_length_25000 <dbl> 11.32735, 10.12663, 10.15769, 12.01356, 1…
$ log_prisec_length_500 <dbl> 7.295356, 6.214608, 8.611945, 8.740680, 6…
$ log_prisec_length_1000 <dbl> 8.195119, 7.600902, 9.735569, 9.627898, 7…
$ log_prisec_length_5000 <dbl> 10.815042, 10.170878, 11.770407, 11.72888…
$ log_prisec_length_10000 <dbl> 11.886803, 11.405543, 12.840663, 12.76827…
$ log_prisec_length_25000 <dbl> 13.41395, 12.79980, 13.79973, 13.70026, 1…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 4.462…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 4.678311, 3…
$ popdens_county <dbl> 44.265706, 8.906492, 35.460814, 75.367038…
$ popdens_zcta <dbl> 145.7164307, 13.6395554, 540.8870404, 130…
$ nohs <dbl> 3.3, 11.6, 7.3, 4.3, 5.8, 7.1, 2.7, 11.1,…
$ somehs <dbl> 4.9, 19.1, 15.8, 13.3, 11.6, 17.1, 6.6, 1…
$ hs <dbl> 25.1, 33.9, 30.6, 27.8, 29.8, 37.2, 30.7,…
$ somecollege <dbl> 19.7, 18.8, 20.9, 29.2, 21.4, 23.5, 25.7,…
$ associate <dbl> 8.2, 8.0, 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, …
$ bachelor <dbl> 25.3, 5.5, 12.7, 10.0, 13.7, 5.9, 17.6, 7…
$ grad <dbl> 13.5, 3.1, 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, …
$ pov <dbl> 6.1, 19.5, 19.0, 8.8, 15.6, 25.5, 7.3, 8.…
$ hs_orless <dbl> 33.3, 64.6, 53.7, 45.4, 47.2, 61.4, 40.0,…
$ urc2013 <dbl> 4, 6, 4, 4, 4, 1, 1, 1, 1, 1, 2, 3, 3, 3,…
$ aod <dbl> 37.363636, 34.818182, 36.000000, 43.41666…
$ state_Not.California <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ city_Not.in.a.city <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,…
Notice, it looks like we gained the log_prisec_length_25000 back with this recipe using the data with our changes to state and city.
Rows: 292
Columns: 37
$ id <fct> 1049.1003, 1073.101, 1073.2006, 1089.0014…
$ value <dbl> 11.659091, 13.114545, 12.228125, 12.23294…
$ fips <fct> 1049, 1073, 1073, 1089, 1103, 1121, 4013,…
$ lat <dbl> 34.28763, 33.54528, 33.38639, 34.68767, 3…
$ lon <dbl> -85.96830, -86.54917, -86.81667, -86.5863…
$ CMAQ <dbl> 8.534772, 9.303766, 10.235612, 9.343611, …
$ zcta_area <dbl> 203836235, 148994881, 56063756, 46963946,…
$ zcta_pop <dbl> 8300, 14212, 32390, 21297, 30545, 7713, 5…
$ imp_a500 <dbl> 5.78200692, 0.06055363, 42.42820069, 23.2…
$ imp_a15000 <dbl> 0.9730444, 2.9956557, 12.7487614, 10.3555…
$ county_area <dbl> 2012662359, 2878192209, 2878192209, 20761…
$ county_pop <dbl> 71109, 658466, 658466, 334811, 119490, 82…
$ log_dist_to_prisec <dbl> 3.721489, 7.301545, 4.721755, 4.659519, 6…
$ log_pri_length_5000 <dbl> 8.517193, 9.683336, 10.737240, 8.517193, …
$ log_pri_length_25000 <dbl> 11.90959, 12.53777, 12.99669, 11.47391, 1…
$ log_prisec_length_500 <dbl> 7.310155, 6.214608, 7.528913, 8.760549, 6…
$ log_prisec_length_1000 <dbl> 8.585843, 7.600902, 9.342290, 9.543183, 8…
$ log_prisec_length_5000 <dbl> 10.214200, 11.262645, 11.713190, 11.48606…
$ log_prisec_length_10000 <dbl> 11.50894, 12.14101, 12.53899, 12.68440, 1…
$ log_prisec_length_25000 <dbl> 13.55979, 14.08915, 14.27363, 13.87170, 1…
$ log_nei_2008_pm10_sum_15000 <dbl> 3.3500444, 6.6241114, 5.8268686, 3.861625…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.1719202, 7.5490587, 8.8205542, 5.219092…
$ popdens_county <dbl> 35.330814, 228.777633, 228.777633, 161.26…
$ popdens_zcta <dbl> 40.718962, 95.385827, 577.735106, 453.475…
$ nohs <dbl> 14.3, 7.2, 0.8, 1.2, 4.8, 16.7, 19.1, 6.4…
$ somehs <dbl> 16.7, 12.2, 2.6, 3.1, 7.8, 33.3, 15.6, 9.…
$ hs <dbl> 35.0, 32.2, 12.9, 15.1, 28.7, 37.5, 26.5,…
$ somecollege <dbl> 14.9, 19.0, 17.9, 20.5, 25.0, 12.5, 18.0,…
$ associate <dbl> 5.5, 6.8, 5.2, 6.5, 7.5, 0.0, 6.0, 8.8, 3…
$ bachelor <dbl> 7.9, 14.8, 35.5, 30.4, 18.2, 0.0, 10.6, 1…
$ grad <dbl> 5.8, 7.7, 25.2, 23.3, 8.0, 0.0, 4.1, 5.7,…
$ pov <dbl> 13.8, 10.5, 2.1, 5.2, 8.3, 18.8, 21.4, 14…
$ hs_orless <dbl> 66.0, 51.6, 16.3, 19.4, 41.3, 87.5, 61.2,…
$ urc2013 <dbl> 6, 1, 1, 3, 4, 5, 1, 2, 5, 4, 4, 6, 6, 1,…
$ aod <dbl> 33.08333, 42.45455, 44.25000, 42.41667, 4…
$ state_Not.California <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,…
$ city_Not.in.a.city <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Great now we no longer have NA values! :)
Note: if you use the skip option for some of the preprocessing steps, be careful. juice() will show all of the results ignoring skip = TRUE. bake() will not necessarily conduct these steps on the new data.